16 research outputs found

    Evidence Transfer for Improving Clustering Tasks Using External Categorical Evidence

    Full text link
    In this paper we introduce evidence transfer for clustering, a deep learning method that can incrementally manipulate the latent representations of an autoencoder, according to external categorical evidence, in order to improve a clustering outcome. By evidence transfer we define the process by which the categorical outcome of an external, auxiliary task is exploited to improve a primary task, in this case representation learning for clustering. Our proposed method makes no assumptions regarding the categorical evidence presented, nor the structure of the latent space. We compare our method, against the baseline solution by performing k-means clustering before and after its deployment. Experiments with three different kinds of evidence show that our method effectively manipulates the latent representations when introduced with real corresponding evidence, while remaining robust when presented with low quality evidence

    DARE Platform a Developer-Friendly and Self-Optimising Workflows-as-a-Service Framework for e-Science on the Cloud

    Get PDF
    The DARE platform, developed as part of the H2020 DARE project (grant agreement No 777413), enables the seamless development and reusability of scientific workflows and applications, and the reproducibility of the experiments. Further, it provides Workflow-as-a-Service (WaaS) functionality and dynamic loading of execution contexts in order to hide technical complexity from its end users. This archive includes v3.5 of the DARE platform

    DARE: A Reflective Platform Designed to Enable Agile Data-Driven Research on the Cloud

    Get PDF
    The DARE platform has been designed to help research developers deliver user-facing applications and solutions over diverse underlying e-infrastructures, data and computational contexts. The platform is Cloud-ready, and relies on the exposure of APIs, which are suitable for raising the abstraction level and hiding complexity. At its core, the platform implements the cataloguing and execution of fine-grained and Python-based dispel4py workflows as services. Reflection is achieved via a logical knowledge base, comprising multiple internal catalogues, registries and semantics, while it supports persistent and pervasive data provenance. This paper presents design and implementation aspects of the DARE platform, as well as it provides directions for future development.PublishedSan Diego (CA, USA)3IT. Calcolo scientific

    dispel4py: A Python framework for data-intensive scientific computing

    Get PDF
    This paper presents dispel4py, a new Python framework for describing abstract stream-based workflows for distributed data-intensive applications. These combine the familiarity of Python programming with the scalability of workflows. Data streaming is used to gain performance, rapid prototyping and applicability to live observations. dispel4py enables scientists to focus on their scientific goals, avoiding distracting details and retaining flexibility over the computing infrastructure they use. The implementation, therefore, has to map dispel4py abstract workflows optimally onto target platforms chosen dynamically. We present four dispel4py mappings: Apache Storm, message-passing interface (MPI), multi-threading and sequential, showing two major benefits: a) smooth transitions from local development on a laptop to scalable execution for production work, and b) scalable enactment on significantly different distributed computing infrastructures. Three application domains are reported and measurements on multiple infrastructures show the optimisations achieved; they have provided demanding real applications and helped us develop effective training. The dispel4py.org is an open-source project to which we invite participation. The effective mapping of dispel4py onto multiple target infrastructures demonstrates exploitation of data-intensive and high-performance computing (HPC) architectures and consistent scalability.</p

    MRbox: Simplifying Working with Remote Heterogeneous Analytics and Storage Services via Localised Views

    Full text link
    The management, analysis and sharing of big data usually involves interacting with multiple heterogeneous remote and local resources. Performing data-intensive operations in this environment is typically a non-automated and arduous task that often requires deep knowledge of the underlying technical details by non-experts. MapReduce box (MRbox) is an open-source experimental application that aims to lower the barrier of technical expertise needed to use powerful big data analytics tools and platforms. MRbox extends the Dropbox interaction paradigm, providing a unifying view of the data shared across multiple heterogeneous infrastructures, as if they were local. It also enables users to schedule and execute analytics on remote computational resources by just interacting with local files and folders. MRbox currently supports Hadoop and ownCloud/B2DROP services and MapReduce jobs can be scheduled and executed. We hope to further expand MRbox so that it unifies more types of resources, and to explore ways for users to interact with complex infrastructures more simply and intuitively

    An Architecture for Information Retrieval over Semi-Collaborating Peer-to-Peer Networks

    No full text
    Abstract. In this paper we study Information Retrieval (IR) over semi-collaborating Peer-to-Peer(P2P) networks. By the term semi-collaborating we mean networks where, although peers have to collaborate in order to achieve overall effectiveness, they do not have to share any proprietary information with the rest of the network, nor they have to be consistent with respect to the retrieval systems they use. For IR in particular, the potential of widely distributed information pools and its effortless access irrespectively of the underlying networking protocols, operating systems or devices, is utterly revolutionary and overwhelming. However, various limitations of existing systems have been identified with perhaps the most important one being that of the successful location of relevant information sources and the efficient query routing in large, highly distributed P2P networks. In this paper we propose a clustering-based architecture for IR over P2P networks in order to solve the aforementioned problem. We also research the usefulness of a simplified version of Dempster-Shafer (D-S) theory of evidence combination for the re-ranking of results in the network. Finally, we present the evaluation results we obtained through simulation in terms of the IR standard precision and recall measures.
    corecore